Thank you for looking into my project.
I have worked on it locally.
And sorry to say that the platform does not have an option to load a file without submitting it.
So I did not actually run the code at the platform.
And I will only be able to do that once the project is returned to me by you.
It should work, but I cannot promise that.
Thank you
The purpose of this research is to give supervisors from the virtual telephony service CallMeMaybe
information on the least effective operators.
3.1 Are there any duplicated values? If so, what's their nature?
The client CallMeMaybe has asked for information regarding effectiveness of operators according to:
For each subject an operator will receive a grate for least effective.
Meaning, the best score will be 0 - effective.
And the worst score will be 3 - least effective.
The max total of available points will be 12.
Operators will be sorted by grade is two ways:
The purpose of this task is to supply the virtual telephony service CallMeMaybe information regarding least effective operators.
• Read the CSV files
• Check columns types
• Change columns name
• Change columns types
• Check for duplicates
• How much data was lost ?
• Check for missing values
• How many operators are there ?
• How big is the data ?
• What are the averages for operators ?
• Visualize distributions.
• Find percentiles of top and bottom operators.
• Look into distributions for clients with different tariff.
• Check amount of missed incoming calls, internal and external.
• Check long waiting time for incoming calls.
• Check if an operator is supposed to make outgoing calls, how many were made.
Form hypotheses regarding the duration of the outgoing calls.
Are there differences between different days of the week ?
Are there differences between operators with different grades ?
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import math
import numpy as np
from scipy import stats as st
from plotly import graph_objects as go
pd.options.display.max_columns = None
pd.options.display.max_colwidth = 200
pd.options.display.float_format = "{:,.2f}".format
def df_info(df):
'''
show information regarding the DF.
'''
display(df.head())
print('shape:', df.shape)
print()
print('******** info() ********')
print(df.info())
print()
print('******** info(memory_usage) ********')
print(df.info(memory_usage = 'deep'))
print()
print('******** describe() ********')
print(df.describe())
print()
print('******** describe(include) ********')
print(df.describe(include = ['object']))
def fix_is_missed_call(row):
'''
find rows with a combination of TRUE for missed call with a duration > 0.
and change the value into FALSE
'''
is_missed_call = row['is_missed_call']
s = str(is_missed_call)
call_duration = row['call_duration']
if (s == 'True'):
if(call_duration > 0):
return False
return is_missed_call
def in_missed_calls_month(row):
'''
rate for in_missed_calls_month
'''
is_missed_call = row['is_missed_call']
if is_missed_call >= 3:
return 3
if is_missed_call >= 2:
return 2
if is_missed_call >= 1:
return 1
return 0
def in_missed_calls(row):
'''
rate for in_missed_calls
'''
is_missed_call = row['is_missed_call']
if is_missed_call >= 4:
return 3
if is_missed_call >= 3:
return 2
if is_missed_call >= 2:
return 1
return 0
def out_AVG_wait(row):
'''
rate out_AVG_wait
'''
AVG_wait_duration = row['AVG_wait_duration']
if AVG_wait_duration > 30:
return 3
if AVG_wait_duration > 25:
return 2
if AVG_wait_duration > 20:
return 1
return 0
def in_AVG_wait(row):
'''
rate in_AVG_wait
'''
wait_duration = row['AVG_wait_duration']
if wait_duration > 32:
return 3
if wait_duration > 21:
return 2
if wait_duration > 15:
return 1
return 0
def out_calls_cnt_month(row):
'''
rate out_calls_cnt_month
'''
calls_count = row['calls_count']
if calls_count < 10:
return 3
if calls_count < 25:
return 2
if calls_count < 50:
return 1
return 0
def out_calls_cnt(row):
'''
rate out_calls_cnt
'''
calls_count = row['calls_count']
if calls_count < 30:
return 3
if calls_count < 100:
return 2
if calls_count < 140:
return 1
return 0
def sum_rate_month(row):
'''
sum up all the monthly ratings
'''
rate_in_missed_call_month = row['rate_in_missed_call_month']
rate_OUT_AVG_wait_duration_month = row['rate_OUT_AVG_wait_duration_month']
rate_in_wait_duration_month = row['rate_in_wait_duration_month']
rate_out_calls_cnt_month = row['rate_out_calls_cnt_month']
rate = rate_in_missed_call_month + rate_OUT_AVG_wait_duration_month + rate_in_wait_duration_month + rate_out_calls_cnt_month
return rate
def sum_rate(row):
'''
sum up all the ratings
'''
rate_in_missed_call = row['rate_in_missed_call']
rate_OUT_AVG_wait_duration = row['rate_OUT_AVG_wait_duration']
rate_in_wait_duration = row['rate_in_wait_duration']
rate_out_calls_cnt = row['rate_out_calls_cnt']
rate = rate_in_missed_call + rate_OUT_AVG_wait_duration + rate_in_wait_duration + rate_out_calls_cnt
return rate
def print_pie(df_name, labels_field, values_field, title_text):
'''
create a pie chart
'''
fig = go.Figure(data=[go.Pie(labels=df_name[labels_field],
values=df_name[values_field])])
fig.update_layout(
title= title_text,
autosize=False,
width=700,
height=700,
margin=dict(
l=50,
r=50,
b=100,
t=100,
pad=4
))
fig.show()
def histogram(df, col, title, xlable, ylable):
'''
print a historgram
'''
plt.figure(figsize=(15, 5))
plt.hist(df[col])
plt.title(title)
plt.xlabel(xlable)
plt.ylabel(ylable)
plt.show()
def histogram_range(df, col, title, xlable, ylable, range_low, range_high):
'''
print a historgram with range
'''
plt.figure(figsize=(15, 5))
plt.hist(df[col], range = (range_low, range_high))
plt.title(title)
plt.xlabel(xlable)
plt.ylabel(ylable)
plt.show()
def run_hypotheses(list1, list2):
'''
run H0 hypotheses, check if both lists are similar.
'''
#the significance level is 5%
alpha = 0.05
results = st.ttest_ind(list1, list2)
print('p-value: ', results.pvalue)
if (results.pvalue < alpha):
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
def get_list_df(data1, data2, col):
'''
get a list from DF, with outgoig calls.
and run the hypotheses test.
'''
list1 = data1[(data1['direction'] == 'out') & (data1['is_missed_call'] == False)][col]
#print(list1.head())
list2 = data2[(data2['direction'] == 'out') & (data2['is_missed_call'] == False)][col]
#print(list2.head())
run_hypotheses(list1, list2)
Download the data and prepare it for analysis
clients = pd.read_csv('telecom_clients_us.csv', sep=',')
dataset = pd.read_csv('telecom_dataset_us.csv', sep=',')
df_info(clients)
Conclusions:
There are 732 rows.
There are no missing values.
tariff_plan should be a category. But using category with the platform creates problems. So will not change it.
date_start should be datetime.
df_info(dataset)
Conclusions:
There are 53902 rows.
There are missing values at: operator_id, internal fields.
Date should be datetime
Direction, should be a category. But using category with the platform creates problems. So will not change it.
operator_id should be int
internal should be Boolean
no need to change names
clients['date_start'] = pd.to_datetime(clients['date_start'], format = '%Y-%m-%d')
find timestamp range
time_max = clients['date_start'].max()
time_max
time_min = clients['date_start'].min()
time_min
Will create columns for: month, day, day of week.
clients['month_date_start'] = clients['date_start'].dt.month
clients['day_date_start'] = clients['date_start'].dt.day
clients['dayofweek_date_start'] = clients['date_start'].dt.dayofweek
clients.head()
remove time and zone from date
dataset['date'] = dataset['date'].map(lambda x: x[0:10] )
convert to datetime
dataset['date'] = pd.to_datetime(dataset['date'], format = '%Y-%m-%d')
find timestamp range
time_max = dataset['date'].max()
time_max
time_min = dataset['date'].min()
time_min
Conclusions:
The date range of the research is from: 02-Aug-2019 to 11-Nov-2019
Will create columns for: month, day, day of week.
dataset['month_date'] = dataset['date'].dt.month
dataset['day_date'] = dataset['date'].dt.day
dataset['dayofweek_date'] = dataset['date'].dt.dayofweek
in order to convert, we will need to take care of missing values first.
dataset[dataset['operator_id'].isnull()].head()
Conclusions:
since operator ID is missing, we cannot use those rows.
They will be deleted.
dataset.shape
dataset = dataset[~dataset['operator_id'].isnull()]
dataset.shape
dataset[dataset['operator_id'].isnull()]
8172 rows were deleted, there are no more mising values at operator_id.
convert operator_id to int.
dataset['operator_id'] = dataset['operator_id'].astype(int)
clients.duplicated().sum()
dataset.duplicated().sum()
dataset[dataset.duplicated()].head()
Conclusions:
There are 4184 duplicated rows.
Check the distribution of different values at the columns
for i in dataset[dataset.duplicated()].columns:
print(i, ':', dataset[dataset.duplicated()][i].nunique())
there are duplicated values at all of the columns.
lets look at the WHEN
dataset[dataset.duplicated()]['date'].dt.date.nunique()
there are 114 diferant dates.
histogram(dataset[dataset.duplicated()],\
'dayofweek_date', 'Distribution of duplicates by days of the week', 'days of the week', 'amount')
let's look at how calls are distributed generally
histogram(dataset, 'dayofweek_date', 'Distribution by days of the week', 'days of the week', 'amount')
Conclusions:
There is no apparent pattern to the duplicate rows.
The distributions for duplicated rows are similar to that of the general distribution.
This issue should be reported to the relevant department.
For now those rows will be deleted
dataset.drop_duplicates(inplace=True)
dataset.info()
there are missing values at internal,
lets look at those rows.
dataset[dataset['internal'].isnull()]['user_id'].count()
there are 55 rows with missing value at: internal
dataset[dataset['internal'].isnull()].head()
dataset[dataset['internal'].isnull()].shape
Lets look at the general distribution between internal and external calls
dataset['internal'].unique()
dataset[dataset['internal'] == True]['user_id'].count()
dataset[dataset['internal'] == False]['user_id'].count()
Since the majority of the calls are external (internal = False),
and there are only 55 rows with missing value, we will write them as False as well.
dataset['internal'] = dataset['internal'].fillna(False)
dataset.shape
dataset.info(memory_usage = 'deep')
Conclusions:
Initially there were 53902 rows.
There are no missing values.
After removal of duplicated rows there are now 41546 rows.
all togther 12356 rows were removed.
dataset['operator_id'].nunique()
there are 1092 operators
the size of the dara set has changed from: 11.1 MB
to 5.5 MB
there are some error in the date, such as:
dataset[(dataset['user_id'] == 166405) & (dataset['is_missed_call'] == True) & (dataset['direction'] == 'in')]
row 1606 is marked as a missed call, yet it has duration.
the developers need to be informed of this.
for now will change such lines to the value of FALSE at is_missed_call
dataset['is_missed_call'] = dataset.apply(fix_is_missed_call, axis = 1)
create a list of months from the dataset
months = dataset['month_date'].unique()
months
Calculate the averages and grade operators for:
number of missed incoming calls.
waiting time for incoming calls.
amount of out going calls.
wait duration of out going calls.
dataset['wait_duration'] = dataset['total_call_duration'] - dataset['call_duration']
what was the average duration of a call
dataset['AVG_wait_duration'] = (dataset['wait_duration'] / dataset['calls_count']).astype(int)
join with clients
dataset = pd.merge(dataset, clients, left_on = 'user_id', right_on = 'user_id', how = 'left', sort = False)
dataset.head()
split the dataset to: in and out
dataset_in = dataset[dataset['direction'] == 'in']
dataset_out = dataset[dataset['direction'] == 'out']
An operator is considered ineffective if they have a large number of missed incoming calls.
dataset_in.head()
dataset_in.shape
find out how many missed calls each operator has by month
dataset_in_missed_calls_month = dataset_in \
.pivot_table(index = ('operator_id', 'month_date'), values = 'is_missed_call', aggfunc = 'sum').reset_index()
dataset_in_missed_calls_month.head()
find out how many missed calls each operator has by full term
dataset_in_missed_calls = dataset_in \
.pivot_table(index = ('operator_id'), values = 'is_missed_call', aggfunc = 'sum').reset_index()
dataset_in_missed_calls.head()
by month
for month in months:
data = dataset_in_missed_calls_month[dataset_in_missed_calls_month['month_date'] == month]
label = ('Distribution of missed IN calls for month ' + str(month))
histogram(data, 'is_missed_call', label, 'missed calls', 'operators')
print('the percentile for 10, 5, 1 percent are:')
print(np.percentile(data['is_missed_call'], [90, 95, 99]))
for the full term
label = ('Distribution of missed calls for full term')
histogram(dataset_in_missed_calls, 'is_missed_call', label, 'missed calls', 'operators')
print('the percentile for 15, 10, 5, 1 percent are:')
print(np.percentile(dataset_in_missed_calls['is_missed_call'], [90, 95, 99]))
Conclusions:
only 10% have more then 1 missed calls at the full term.
and most operators have up to one missed call in a month.
for monthly:
up to 3 will be level 3.
up to 2 will be level 2.
up to 1 will be level 1.
for full time:
up to 4 will be level 3.
up to 3 will be level 2.
up to 2 will be level 1.
rate the operators:
dataset_in_missed_calls_month['rate_in_missed_call_month'] = dataset_in_missed_calls_month.apply(in_missed_calls, axis = 1 )
dataset_in_missed_calls['rate_in_missed_call'] = dataset_in_missed_calls.apply(in_missed_calls, axis = 1 )
label = 'general distribution AVG wait duration of outgoing calls'
histogram(dataset_out, 'AVG_wait_duration', label, 'wait duration[sec]', 'operators')
dataset_out_referance_AVG_wait = dataset_out[dataset_out['AVG_wait_duration'] < 80]
dataset_out_referance_AVG_wait.boxplot(column=['AVG_wait_duration'])
plt.title('AVG_wait_duration')
plt.show()
Find out what is the average wait duration for outgoing calls by month
dataset_out_AVG_wait_month = dataset_out_referance_AVG_wait \
.pivot_table(index = ('operator_id', 'month_date'), values = 'AVG_wait_duration', aggfunc = 'mean').reset_index()
dataset_out_AVG_wait_month.head()
Find out what is the average wait duration for outgoing calls full term
dataset_out_AVG_wait = dataset_out_referance_AVG_wait. \
pivot_table(index = ('operator_id'), values = 'AVG_wait_duration', aggfunc = 'mean').reset_index()
dataset_out_AVG_wait.head()
by month
for month in months:
data = dataset_out_AVG_wait_month[dataset_out_AVG_wait_month['month_date'] == month]
label = ('Distribution of wait duration of OUT calls for month ' + str(month))
histogram(data, 'AVG_wait_duration', label, 'wait duration[sec]', 'operators')
print('the percentile for 10, 5, 1 percent are:')
print(np.percentile(data['AVG_wait_duration'], [90, 95, 99]))
full term
histogram(dataset_out_AVG_wait, 'AVG_wait_duration', 'Distribution of wait duration of OUT calls for full term', \
'wait duration[sec]', 'operators')
print('the percentile for 10, 5, 1 percent are:')
print(np.percentile(dataset_out_AVG_wait['AVG_wait_duration'], [90, 95, 99]))
dataset_out_AVG_wait.boxplot(column=['AVG_wait_duration'])
plt.title('AVG wait duration')
plt.show()
print(np.percentile(dataset_out_AVG_wait['AVG_wait_duration'], [90, 95, 99]))
duration of more them 40 seconds is an outlayer
dataset_out_AVG_wait[dataset_out_AVG_wait['AVG_wait_duration'] < 40]['AVG_wait_duration'].describe()
conclusion:
only 10% have of operator have a wait duration of more then 22 seconds.
with the average being 15 seconds.
and with 75% of operators wait up to 19 seconds.
only 5% wait longer then 25 seconds.
we should consider a wait duration of more then 20 seconds as long-1.
and above 25 as long-2
and above 30 as long-3
rate the operators:
dataset_out_AVG_wait_month['rate_OUT_AVG_wait_duration_month'] = dataset_out_AVG_wait_month.apply(out_AVG_wait, axis = 1 )
dataset_out_AVG_wait['rate_OUT_AVG_wait_duration'] = dataset_out_AVG_wait.apply(out_AVG_wait, axis = 1 )
An operator is considered ineffective if they have a long waiting time for incoming calls.
Find out what is the average wait duration for incomin calls by month
dataset_in_AVG_wait_month = dataset_in \
.pivot_table(index = ('operator_id', 'month_date'), values = 'AVG_wait_duration', aggfunc = 'mean').reset_index()
dataset_in_AVG_wait_month.head()
Find out what is the average wait duration for incomin calls full term
dataset_in.head()
dataset_in_AVG_wait = dataset_in \
.pivot_table(index = 'operator_id', values = 'AVG_wait_duration', aggfunc = 'mean').reset_index()
dataset_in_AVG_wait.head()
by months
for month in months:
data = dataset_in_AVG_wait_month[dataset_in_AVG_wait_month['month_date'] == month]
label = ('Distribution of AVG wait duration of IN calls for month ' + str(month))
histogram(data, 'AVG_wait_duration', label, 'AVG wait duration[sec]', 'operators')
print('the percentile for 10, 5, 1 percent are:')
print(np.percentile(data['AVG_wait_duration'], [90, 95, 99]))
full term
histogram(dataset_in_AVG_wait, 'AVG_wait_duration', \
'Distribution of AVG wait duration of IN calls for full term', 'wait duration[sec]', 'operators')
print('the percentile for 10, 5, 1 percent are:')
print(np.percentile(dataset_in_AVG_wait['AVG_wait_duration'], [90, 95, 99]))
we will considere above 60 seconds of wait to be an outlayer.
histogram_range(dataset_in_AVG_wait, 'AVG_wait_duration', \
'Distribution of AVG wait duration of IN calls for full term', 'wait duration[sec]', 'operators', 0, 60)
dataset_in_AVG_wait[dataset_in_AVG_wait['AVG_wait_duration'] <= 60].boxplot(column=['AVG_wait_duration'])
plt.title('AVG wait duration')
plt.show()
print(np.percentile(dataset_in[dataset_in['AVG_wait_duration'] <= 60]['AVG_wait_duration'], [90, 95, 99]))
dataset_in[dataset_in['AVG_wait_duration'] <= 60]['AVG_wait_duration'].describe()
Conclusions:
Only 10% have a wait duration for incoming calls of more than 32 seconds.
With 75 % wait less than 21 seconds
With the average at 13 seconds
after also looking at the distributions:
above 15 will be long 1.
above 21 will be long 2.
above 32 will be long 3.
rate the operators:
dataset_in_AVG_wait_month['rate_in_wait_duration_month'] = dataset_in_AVG_wait_month.apply(in_AVG_wait, axis = 1 )
dataset_in_AVG_wait['rate_in_wait_duration'] = dataset_in_AVG_wait.apply(in_AVG_wait, axis = 1 )
if an operator is supposed to make outgoing calls, a small number of them is also a sign of ineffectiveness.
find out how many out calls were made
by month
dataset_out_calls_cnt_month = dataset_out \
.pivot_table(index = ('operator_id', 'month_date'), values = 'calls_count', aggfunc = 'sum').reset_index()
dataset_out_calls_cnt_month.head()
full term
dataset_out_calls_cnt = dataset_out \
.pivot_table(index = ('operator_id'), values = 'calls_count', aggfunc = 'sum').reset_index()
dataset_out_calls_cnt.head()
per month
for month in months:
data = dataset_out_calls_cnt_month[dataset_out_calls_cnt_month['month_date'] == month]
label = ('Distribution of OUT going calls for month ' + str(month))
histogram(data, 'calls_count', label, 'out going calls', 'operators')
print('the percentile for 10, 5, 1 percent are:')
print(np.percentile(data['calls_count'], [90, 95, 99]))
since we are looking for inafective, we will look at operators with a small amount of out calls.
for month in months:
data = dataset_out_calls_cnt_month[dataset_out_calls_cnt_month['month_date'] == month]
label = ('Distribution of OUT going calls for month ' + str(month))
histogram_range(data, 'calls_count', label, 'out going calls', 'operators', 0, 200)
print('the percentile for 10, 5, 1 percent are:')
print(np.percentile(dataset_out_calls_cnt_month[(dataset_out_calls_cnt_month['calls_count'] <= 200) \
& (dataset_out_calls_cnt_month['month_date'] == month)]\
['calls_count'], [10, 5, 1]))
the precintail are very low, so accorind to the destbution lower then 20 is low-1 and lower then 10 is low-2
full term
histogram(dataset_out_calls_cnt, 'calls_count', 'Distribution of OUT going calls for full term', 'calls count', 'operators')
we will look at the range of up to 500 calls.
histogram_range\
(dataset_out_calls_cnt, 'calls_count', 'Distribution of OUT going calls for full term', 'calls count', 'operators', 0, 500)
dataset_out_calls_cnt[dataset_out_calls_cnt['calls_count'] < 500].boxplot(column=['calls_count'])
plt.title('calls out count')
plt.show()
dataset_out_calls_cnt[dataset_out_calls_cnt['calls_count'] < 500]['calls_count'].describe()
print(np.percentile(dataset_out_calls_cnt[dataset_out_calls_cnt['calls_count'] < 500]['calls_count'], [10, 5, 1]))
low-1 will be below 30, low-2 will be below 15
Conclusions:
the precintail are very low, so according to the distribution:
for monthly:
below 50 will be 1
below 25 will be 2
below 10 will be 3
for full term:
below 140 will be 1
below 100 will be 2
below 30 will be 1
rate the operators:
dataset_out_calls_cnt_month['rate_out_calls_cnt_month'] = dataset_out_calls_cnt_month.apply(out_calls_cnt_month, axis = 1)
dataset_out_calls_cnt['rate_out_calls_cnt'] = dataset_out_calls_cnt.apply(out_calls_cnt, axis = 1)
merge all the results togther
merge: dataset_in_missed_calls_month with dataset_out_AVG_wait_month
rate_data_month = pd.merge(dataset_in_missed_calls_month[['operator_id', 'month_date', 'rate_in_missed_call_month']],\
dataset_out_AVG_wait_month[['operator_id','month_date', 'rate_OUT_AVG_wait_duration_month']],\
on=('operator_id', 'month_date'), how='left')
#rate_data_month.head()
merge: dataset_in_AVG_wait_duration_month
rate_data_month = pd.merge(rate_data_month, dataset_in_AVG_wait_month \
[['operator_id','month_date', 'rate_in_wait_duration_month']],\
on=('operator_id', 'month_date'), how='left')
#rate_data_month.head()
merge with: dataset_out_calls_cnt_month
rate_data_month = pd.merge(rate_data_month, dataset_out_calls_cnt_month \
[['operator_id','month_date', 'rate_out_calls_cnt_month']], \
on=('operator_id', 'month_date'), how='left')
#rate_data_month.head()
replace null with 0
rate_data_month = rate_data_month.fillna(0)
sum up the different ratings:
rate_data_month['rate_month'] = rate_data_month.apply(sum_rate_month, axis = 1)
#rate_data_month.head()
cast values to int
rate_data_month['rate_month'] = rate_data_month['rate_month'].astype(int)
rate_data_month.sort_values(by='rate_month')[['operator_id','month_date', 'rate_month']].tail(10)
show the grade for each operator by month
rate_month = rate_data_month.pivot_table\
(index = 'operator_id', columns = 'month_date', values = 'rate_month', aggfunc = 'sum').fillna(0).reset_index()
rate_month = rate_month.astype(int)
rate_month.head()
merge: dataset_in_missed_calls with dataset_out_AVG_wait
rate_data = pd.merge(dataset_in_missed_calls[['operator_id', 'rate_in_missed_call']],\
dataset_out_AVG_wait[['operator_id', 'rate_OUT_AVG_wait_duration']],\
on=('operator_id'), how='left')
#rate_data.head()
merge: dataset_in_AVG_wait_duration
rate_data = pd.merge(rate_data, dataset_in_AVG_wait[['operator_id', 'rate_in_wait_duration']],\
on=('operator_id'), how='left')
#rate_data.head()
merge with: dataset_out_calls_cnt_month
rate_data = pd.merge(rate_data, dataset_out_calls_cnt[['operator_id', 'rate_out_calls_cnt']],\
on=('operator_id'), how='left')
#rate_data.head()
rate_data = rate_data.fillna(0)
sum up all the different ratings:
rate_data['rate'] = rate_data.apply(sum_rate, axis = 1)
#rate_data.head()
rate_data['rate'] = rate_data['rate'].astype(int)
rate_data[['operator_id', 'rate']].head()
how long will the operator list be ?
#user can modify this value, up to 50.
operators_on_list_month = 10
operators_on_list_full_term = 30
for col in rate_month.columns:
if col == 'operator_id':
continue
label = 'Distribution of least effective operators for month ' + str(col)
histogram(rate_month, col, label, 'least effective grade', 'operators')
data = rate_month[['operator_id', col]].sort_values(by = col, ascending = False)
print(data.head(operators_on_list_month))
label = 'Distribution of score of least effective operators for full term'
histogram(rate_data, 'rate', label, 'least effective grade', 'operators')
data = rate_data[['operator_id', 'rate']].sort_values(by = 'rate', ascending = False)
print(data.head(operators_on_list_full_term))
group_tariff = clients.groupby('tariff_plan').agg({'user_id':'count'}).reset_index()
group_tariff
title_text = 'proportions of users by tariff plan'
print_pie(group_tariff, 'tariff_plan', 'user_id', title_text)
Conclusion:
It seems that taarif A is the most expansive.
Then B.
And taarif C is the least expensive.
dataset_out_referance_AVG_wait.pivot_table(index = 'tariff_plan', values = 'wait_duration', aggfunc = ('mean', 'median'))
Conclusion:
it seems that the operators will wait longer for someone to pick up on the other side with calls for client A.
dataset_in[dataset_in['is_missed_call'] == True]\
.pivot_table(index = 'tariff_plan', values = 'is_missed_call', aggfunc = ('count'))
dataset_in.pivot_table(index = 'tariff_plan', values = 'wait_duration', aggfunc = ('mean', 'median'))
Conclusion:
it seems that the operators have more IN missing calls and also longer duration time of calls from clients with taarif B.
Are the call durations for outgoing calls similar on all days ?
H0 call duration of outgoing calls is similar between days.
H1 call duration of outgoing calls is different between days.
are the the weekend days similar to other days ?
weekend days:
dataset_5_6 = dataset[dataset['dayofweek_date'] >= 5]
#dataset_5_6.head()
other days:
dataset_0_4 = dataset[dataset['dayofweek_date'] <= 4]
#dataset_0_4.head()
get_list_df(dataset_5_6, dataset_0_4, 'call_duration')
Conclusions:
Call duration is different between weekend days and other days.
are the weekend days similar to weekend days ?
dataset_5 = dataset[dataset['dayofweek_date'] == 5]
dataset_6 = dataset[dataset['dayofweek_date'] == 6]
get_list_df(dataset_5, dataset_6, 'call_duration')
Conclusions:
Call duration between weekend days is similar.
are the other days similar to other days ?
n = 4
for i in range(n-1):
for j in range(i, n-1):
print('*******************')
print('for days: {} and {}'.format(i+1, j+2))
dataset_i = dataset[dataset['dayofweek_date'] == i]
dataset_j = dataset[dataset['dayofweek_date'] == j]
get_list_df(dataset_i, dataset_j, 'call_duration')
Conclusions:
Call duration between other days is similar.
Is there a statistical difference of outgping calls duration between operators with high score – above 3 including.
And with those who have a low score - below 3.
H0 call duration of outgoing calls is similar between groups of higt and low grade.
H1 call duration of outgoing calls is different between groups of higt and low grade.
for col in rate_month.columns:
if col == 'operator_id':
continue
operator_high = rate_month[rate_month[col] >= 3]['operator_id']
operator_low = rate_month[rate_month[col] < 3]['operator_id']
call_duration_high = dataset[(dataset['month_date'] == col) & \
(dataset['operator_id'].isin(operator_high))\
& (dataset['is_missed_call'] == False)]['call_duration']
call_duration_low = dataset[(dataset['month_date'] == col) & \
(dataset['operator_id'].isin(operator_low)) &\
(dataset['is_missed_call'] == False)]['call_duration']
print('*********************')
print('for month:', col)
run_hypotheses(call_duration_high, call_duration_low)
Conclusion:
out of 4 months, at 3 of them there is a statistical difference between the duration of the outgoing calls between operators with high and low grade.
We have managed to calculate using a grading system a score for every operator to determine least effectiveness.
The higher the score, the less effective is the operator.
Max score is 12.
We have creates a list on a monthly basis, and a list for the full term, with the top least effective.
We have also concluded that the call duration of the outgoing calls is similar between the other days.
And also similar between the weekend days.
Yet the duration is different between the weekend days and the other days.
The operators seem to favor calls of clients with taarif A.
And the disfavor clients with taarif B.
A list of articles and the questions they helped me answer:
link: https://www.nextiva.com/blog/what-is-a-virtual-phone-system.html
Why is it a good idea for a business to use virtual phones ?
What are the benefits ?
What are the top features of a virtual phone system?
link: https://talkroute.com/how-does-a-virtual-phone-system-work/
How Technology has changed our view of telephones ?
3.Benefits of using a Virtual Phone Number for your Business
link: https://tweakyourbiz.com/management/customer-service/virtual-phone-benefits
What are the benefits of using a Virtual Phone Number for you’re a business ?
link: https://clutch.co/bpo/virtual-assistant/resources/how-improve-telephone-customer-service
How important is it for a clients to have a short wait duration ?
Who important is it to have a short conversation ?